SILA: a system for scientific image analysis

A great deal of the images found in scientific publications are retouched, reused, or composed to enhance the quality of the presentation. In most instances, these edits are benign and help the reader better understand the material in a paper. However, some edits are instances of scientific misconduct and undermine the integrity of the presented research. Determining the legitimacy of edits made to scientific images is an open problem that no current technology can perform satisfactorily in a fully automated fashion. It thus remains up to human experts to inspect images as part of the peer-review process. Nonetheless, image analysis technologies promise to become helpful to experts to perform such an essential yet arduous task. Therefore, we introduce SILA, a system that makes image analysis tools available to reviewers and editors in a principled way. Further, SILA is the first human-in-the-loop end-to-end system that starts by processing article PDF files, performs image manipulation detection on the automatically extracted figures, and ends with image provenance graphs expressing the relationships between the images in question, to explain potential problems. To assess its efficacy, we introduce a dataset of scientific papers from around the globe containing annotated image manipulations and inadvertent reuse, which can serve as a benchmark for the problem at hand. Qualitative and quantitative results of the system are described using this dataset.

. Architecture of SILA. We adopted a client-server model, with frontend components focused on usability, and backend components focused on scalability and extensibility. New forensic solutions can be easily added to the system if they are made available as Forensic Containers. Labels in italic on the bottom of the components detail the technology used in SILA to implement them. Arrows express the data flow between components.
Fig. S1 depicts the architecture of SILA, properly divided into frontend and backend components. In the frontend, our main goal was usability. Within it, the central piece is the Graphical User Interface (GUI) Server, whose aim is to provide an intuitive interface that most people are familiar with. Therefore, we proposed the usage of a web-based GUI, with rich input and output graphical components (such as web-pages, click buttons, selection boxes, etc.). These components can be rendered on the client side by the currently popular web browsers, with no need for the installation of specialized software by the analysts. To use the system, all they need to do is accessing the web-address of the application.
Delving into the backend piece, our main goals were scalability and extensibility. Within it, the interaction with the frontend is centralized at the Application Programming Interface (API) Server, which is the component responsible for making calls to the forensic tools available in the system. To ensure scalability, we designed the API Server to follow the Representational State Transfer (REST) architectural style [1]. For simple workloads, such as the panel segmentation of a single selected image, the API Server may trigger the forensic tool of interest directly, by making a stateless call to it. For more complex scenarios, such as the copy-move detection of hundreds of images, we designed the use of the Queuing Server, a middleware responsible for managing the various requests from active clients and arbitrating them to the proper available Forensic Containers.
Talking about containers, they represent the tool we used to ensure extensibility. Hence, we implemented each task from the proposed workflow as an independent Forensic Container, a standalone executable piece of software that contains everything needed to run it, including libraries, configurations, and even specific operating-system environments. By using tools such as Docker [2], novel and better forensic tools can be easily added to the system, as long as they are bundled as a container.
Lastly, we added a Data Server to the application, which is responsible for persisting all the data either ingested to (such as the questioned paper PDF files), or generated by the system (such as the extracted images, captions, panels, manipulation masks, and provenance graphs). The Data Server is a resource shared among the API Server and the Forensic Containers.

CONTENT EXTRACTION
Here we provide more details of content extraction through its two constituent sub-tasks, namely (i) Image Extraction and (ii) Image Caption Extraction.  Fig. S2. Image extraction workflow. The entire process is automatic and starts from a given PDF file and ends with a set of images and their respective metadata. Possible errors due to an irregular PDF generation process are mitigated through the extraction tasks depicted as rounded rectangles.

A. Image Extraction
Metric details: To evaluate the quality of SILA's image extraction solution, we extracted images from a selection of PDF files that contained the original figures provided by the respective publishers' websites (it is common practice for scientific publishers to provide downloadable high-quality versions of figures from papers on their websites). The idea was to report how many of the original images were successfully obtained with the system. This metric is called image recall (IR ∈ [0..1], see Eq. S1). As one might observe, it was computed based on the total numbers of available and of effectively matched original figures, and we wanted it as close to one as possible.
To compute IR, we needed to compare pairs of images (with one coming from the publisher's website and the other coming from SILA's extraction module), and decide if they represented the same content (i.e., they were a match). Rather than doing this manually, we adopted an automatic approach where we found, for each publisher's website image, the most similar one extracted by SILA from the same paper, based on a pairwise image similarity score S. Two images were a match if their score S was above a threshold τ S . Considering that matching figures often present differences in resolution and translation (since the original figures are usually provided in the publisher's website at a higher resolution, with differences in the outside borders and cropping place), we selected a similarity metric that is well-known for being invariant to scaling and translation. Namely, we computed S as the Complex-Wavelet Structural Similarity (CW-SSIM) index [3]. Moreover, based on empirical analyses, we chose S > τ S = 0.7.
Operation details: Files in Portable Document Format (PDF) comprise binary streams of objects such as text, annotations, fonts, images, etc. (i.e., the body of the document), and cross-reference tables (i.e., Xref tables) whose purpose is to index these objects. As a consequence, PDF image streams can be found through the Xref tables, which define the position, compression type, format, and other aspects of each embedded image. In the case of SILA, we proposed to rely on the Xref tables to extract the embedded images directly from the PDF stream.
While extracting images from a series of PDF streams, we have occasionally found four types of problems that impair the quality of the images, often rendering them unsuitable for further analysis. The first one (transparency band error) is related to images embedded with an additional transparency band. In such cases, instead of obtaining a single image, two figures are wrongly extracted from the stream, both individually incomplete but with complementary appearance. We fixed this issue by automatically combining their content into a single element.
In a similar fashion, the second problem (multiple images error) occurs when single images are embedded, for whatever the reason, as multiple objects, each one containing a different portion of the original figure. We fixed this by automatically concatenating the portions into a single image.
The third problem (multiple copies error), in turn, happens due to a faulty PDF generation, when the same image is embedded many times on the same page position, with overlaps that go unnoticed in regular PDF content rendering. As a consequence, one single image leads to various copies of itself in a typical extraction process. We fixed this problem by keeping only one of them.
Lastly, the fourth problem (scanned pages error) refers to PDF documents generated from scanned documents, whose entire pages are embedded as single images. As a result, a PDF document with n pages is actually encoded as n images, one for each page. In such cases, we developed an additional routine that detects this situation and, instead of applying a typical Xref-table-based image extraction, it executes a vision-based page segmentation similar to [4]. Fig. S2 summarizes the image extraction process proposed in this work. The workflow starts with a given PDF file and ends with a set of extracted images and their respective metadata. For each obtained image, the metadata consist of the number of the PDF page containing the image, as well as the (x 0 , y 0 , x 1 , y 1 ) bounding box that encloses its content.

B. Image Caption Extraction
Metric details: Two metrics were used to assess the integrity of the extracted captions, by comparing them with their respective ground-truth image-wise counterparts. On the one hand, normalized Levenshtein distance [5] (LD ∈ [0..1]) was used to measure how different ground-truth and extracted captions were at the character level. We wanted LD as close to zero as possible, since it is a distance. On the other hand, BERTScore [6] (BS ∈ [0..1]) was used to measure the semantic similarity between ground-truth and extracted captions, through the SciBert word embedding model [7] (given it was trained with scientific text examples). We wanted BS as close to one as possible.
Operation details: SILA retrieves captions from the PDF text and associates them to the images obtained during image extraction. To do so, the system extracts all paragraphs from the PDF stream with their respective metadata. Similar to the case of images, paragraph metadata comprise the text page number and (x 0 , y 0 , x 1 , y 1 ) page bounding box. To select only image captions, we considered only the paragraphs that started with keywords such as " Figure", "Fig.", or other strings properly set in the system's configuration. To associate images and captions, we relied on the matching of image and caption metadata. To cope with the large variety of scientific paper layouts, we adopted two simple and yet robust rules to associate images and captions. Namely, (i) an image and its caption should be on the same page (i.e., they should have the same metadata page number), and (ii) they should be as close as possible in the page's Euclidean object-placement space, with respect to their bounding boxes. A greedy algorithm was used to attribute one caption to each image, following these rules.

Metric details:
To measure the quality of the panel segmentation module, we used the intersection over union score (IoU ∈ [0..1], see Eq. S2) to compare ground-truth segmentation masks (gt_mask) and the respective panels provided by the segmentation solution (sl_mask), at the pixel level. As expected, the higher the IoU value, the better the solution.

IMAGE RANKING
Metric details: By comparing computed and ground-truth image ranks, we reported the performance of image ranking in terms of precision at the top-N retrieved images (P@N, with N ∈ {1, 5, 10}, see Eq. S3), averaged over a selection of images. The idea was to assess the SILA's image ranking module as a recommender subsystem: the higher the precision values, the better the recommendations.
Operation details: Previous works have tried to classify images from scientific papers [8][9][10], but a consensus on the classes needed to organize elements (such as charts, diagrams, illustrations, etc.) has yet to emerge. This problem gets even harder when one takes into consideration the different scientific disciplines and their specific needs, ranging, for instance, from Mathematics (with need to support equations) to Medicine (with need to support specialized image acquisition like magnetic resonance imaging), and beyond to Biology (with need to support a variety of types of microscopy). Aiming to develop a more general-purpose workflow that fits different domains, we replaced the classification approach with a visual similarity one. Given an image of interest, namely the query, and a set of images previously extracted from a selection of PDF files, namely the gallery, SILA sorts the gallery elements from the most to the least similar to the query, regardless of their type, to present to the analyst. Fig. S3 depicts an example of the ranked images ideally retrieved for a given query. To accomplish this task, we relied on techniques from the field of Content-Based Image Retrieval (CBIR). Typical CBIR solutions reduce the semantic gap between the image pixel values and the system's purpose through a multi-level representation of the images. In the case of SILA, the purpose was to sort gallery images, preferably retrieving exact copies and near-duplicates in the first positions, followed by semantically similar images, followed by unrelated content. We wanted exact copies and near-duplicates first because they constitute key elements in the process of detecting image reuse and splicing, since they come from the same imaging pipeline of the query. Semantically similar images, on the contrary, are visually similar to the query (e.g., pictures of the same type of cell), but come from different devices or capture points. Their retrieval over unrelated images is still desired, though, since they may help the analyst to spot problems such as idea plagiarism.
Given our preferences for near-duplicates over semantically similar images, we chose matching local features (i.e. interest points) in the lowest levels of CBIR representation. Contrary to global features such as perceptual hashing [15], which embed the entire image content as a single feature vector, local features aim at detecting and describing multiple regions of interest within the image content, which are tolerant to color and spatial transformations (e.g., rotations, scaling). As a consequence, whenever the previous step of image content extraction was executed, all the resulting images were immediately described with multiple local features, which were then indexed with an inverted-file approach [16] (i.e.,        [11][12][13][14]. each multi-dimensional feature vector was entirely stored with the identification number of its source image). This inverted-file index constituted the gallery information. During content ranking, whenever a query was selected, each one of its feature vectors was used to retrieve the k-nearest gallery feature vectors, within the n-dimensional space they constituted. Since each gallery vector pointed back to its source image, close gallery images could be traced back and receive two types of votes: (i) one based on the position p ∈ [1, k] of the gallery feature within the k retrieved features, and (ii) one based on the distance d ≥ 0 of the gallery feature to the query feature. The position-based votes were weighted by the inverse of p (i.e., the smaller the value of p, the more relevant the vote to the respective gallery image). In a similar fashion, the distance-based votes were weighted by the inverse of d. Once all the query feature vectors were processed, votes were summed and min-max normalized in a gallery-image-wise manner, for each one of the two vote types. Consequently, each gallery image received two votes, namely V pos ∈ [0..1] (related to p) and V dist ∈ [0..1] (related to d). These votes were then linearly combined into a single one, allowing for a further sorting of the gallery images from the most to the least voted. This sorted list became a query-wise image rank, which was the desired output of the image ranking task.

COPY-MOVE DETECTION
Metric details: To compare the ground-truth manipulation masks and copy-move detection masks computed by SILA, we followed the setup proposed in [17]. Therefore, we relied on the pixel-wise F 1 -score metric, defined as: In such case, #TP was the number of corresponding pixels over both masks that were accordingly white; #FP was the number of corresponding pixels that were black on the ground-truth mask but white on the solution-based one; and #FN was the number of corresponding pixels that were white on the ground-truth but black on the solution-based mask. By definition, we wanted F 1 ∈ [0.
.1] to be as close to one as possible.

Literature comparison:
In Table S1, we report the results of copy-move detection over the same scientific images obtained by different approaches from the forensic literature. Within this group, there were methods specialized Details F 1 -score Method Methods based on noise analyses.
in highlighting noise inconsistencies [18][19][20], methods that tackled the problem of copy-move detection with classic content-matching approaches such as sparse or dense local descriptors [17,21,22], and more recent methods relying on deep-learning-based strategies [23,24]. As one might observe, all the methods presented performance far from those belonging to our proposed solutions (represented by the last four rows of Table S1). Indeed, the less effective ones were those based on either noise residuals (which were probably not suitable for low-resolution images such as the ones published in scientific papers), or deep learning (which were trained only on natural-scene images). Better results were instead achieved by methods that used interest points and block-based local descriptors devised for visual content duplication detection. Nonetheless, even these results were below the performance of our approaches.

Ablation study:
We carried out an ablation study to better understand the relevance of some steps included in our copy-move detection algorithm. First, we considered a basic version relied upon Zernike moments, which did not include the panel segmentation step and the sub-panel pairwise matching strategy (namely, Basic Zernike within Table S1). After that, we considered two modified versions, which both included the panel segmentation and sub-panel pairwise analysis steps proposed in this work. One was based on Zernike moments (namely, Zernike), and the other was based on the data from the image RGB channels (namely, RGB). Both imposed improvements over the results obtained with Basic Zernike. Moreover, while using Zernike features and RGB data provided almost comparable results individually, their fusion boosted performance (see Zernike + RGB), resulting in the best copy-move detection image masks. This is the approach currently adopted by SILA.

PROVENANCE ANALYSIS
Metric details: To evaluate the SILA's capabilities of building provenance graphs, we followed the experimental setup proposed in [25]. Therefore, to compare a ground-truth provenance graph G (V , E ) and its corresponding graph G(V, E) computed by SILA, we used the harmonic mean of precision and recall (i.e., the F 1 -score) respective to two key concepts, namely (i) the obtained image vertices V and their V counterparts, and (ii) the established edges E and their E counterparts.
The first metric, focused on comparing the vertices, is called vertex overlap (VO ∈ [0..1]): and we wanted it as close to one as possible. The second one, focused on comparing the edges, is called edge overlap (EO ∈ [0..1]): and we also wanted it as close to one as possible. Additionally, we also computed the F1-score of retrieving both vertices and edges simultaneously, dubbed vertex and edge overlap (VEO ∈ [0..1]): VEO(G , G) = 2 × |V ∩ V| + |E ∩ E| |V | + |V| + |E | + |E| . (S7) We computed these three metrics for each available (G , G) pair and reported their average value. In summary, these metrics assess the overlap between each G and corresponding G provenance graph. Lastly, in the particular case of EO and VEO, we adopted the undirected-graph setup [26], where an edge within E is considered a hit whenever there is a homologous edge within E that connects equivalent vertices, regardless of the edges' orientations.